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What fascinates us about animal behavior is its richness and complexity, but understanding 
behavior and its neural basis requires a simpler description. Traditionally, simplification has been 
imposed by training animals to engage in a limited set of behaviors, by hand scoring behaviors 
into discrete classes, or by limiting the sensory experience of the organism. An alternative is to 
ask whether we can search through the dynamics of natural behaviors to find explicit evidence that 
these behaviors are simpler than they might have been. We review two mathematical approaches 
to simplification, dimensionality reduction and the maximum entropy method, and we draw on 
examples from different levels of biological organization, from the crawling behavior of C. elegans 
to the control of smooth pursuit eye movements in primates, and from the coding of natural scenes 
by networks of neurons in the retina to the rules of English spelling. In each case, we argue that the 
explicit search for simplicity uncovers new and unexpected features of the biological system, and 
that the evidence for simplification gives us a language with which to phrase new questions for the 
next generation of experiments. The fact that similar mathematical structures succeed in taming 
the complexity of very different biological systems hints that there is something more general to be 
discovered. 



I. INTRODUCTION 

The last decades have seen an explosion in our abil- 
ity to characterize the microscopic mechanisms — the 
molecules, cells, and circuits — that generate the behavior 
of biological systems. In contrast, our characterization 
of behavior itself has advanced much more slowly. Start- 
ing in the late nineteenth century, attempts to quantify 
behavior focused on experiments in which the behavior 
itself was restricted, for example by forcing an observer 
to choose among a limited set of alternatives. In the 
mid-twentieth century, ethologists emphasized the im- 
portance of observing behavior in its natural context, but 
here, too, the analysis most often focused on the counting 
of discrete actions. Parallel to these efforts, neurophysiol- 
ogists were making progress on how the brain represents 
the sensory world by presenting simplified stimuli and 
labeling cells by preference for stimulus features. 

Here we outline an approach in which living systems 
naturally explore a relatively unrestricted space of motor 
outputs or neural representations, and we search directly 
for simplification within the data. While there is often 
suspicion of attempts to reduce the evident complexity 
of the brain, it is unlikely that understanding will be 
achieved without some sort of compression. Rather than 
restricting behavior (or our description of behavior) from 
the outset, we will let the system "tell us" whether our 
favorite simplifications are successful. Furthermore, we 
start with high spatial and temporal resolution data since 
we do not know the simple representation ahead of time. 
This approach is made possible only by the combination 
of new experimental methods that generate larger, higher 
quality data sets with the application of mathematical 
ideas that have a chance of discovering unexpected sim- 
plicity in these complex systems. We present four very 



different examples where finding such simplicity informs 
our understanding of biological function. 



II. DIMENSIONALITY REDUCTION 

In the human body there are approximately 100 joint 
angles and substantially more muscles. Even if each mus- 
cle has just two states (rest or tension), the number of 
possible postures is enormous, 2 Armuaclcs ~ 10 30 . If our 
bodies moved aimlessly among these states, characteriz- 
ing our motor behavior would be hopeless — no experi- 
ment could sample even a tiny fraction of all the possible 
trajectories. Moreover, wandering in a high dimensional 
space is unlikely to generate functional actions that make 
sense in a realistic context. Indeed, it is doubtful that a 
plausible neural system would independently control all 
the muscles and joint angles without some coordinating 
patterns or "movement primatives" from which to build 
a repertoire of actions. There have been several motor 
systems in which just such a reduction in dimensionality 
has been found [THS]. Here we present two examples of 
behavioral dimensionality reduction which represent very 
different levels of system complexity: smooth pursuit eye 
movements in monkeys and the free wiggling of worm-like 
nematodes. These examples are especially compelling as 
so few dimensions are required for a complete description 
of natural behavior. 



Smooth pursuit eye movements 

Movements are variable even if conditions are carefully 
repeated, but the origin of that variability is poorly un- 
derstood. Variation might arise from noise in sensory 



2 



processing to identify goals for movement, in planning 
or generating movement commands, or in the mechani- 
cal response of the muscles. The structure of behavioral 
variation can inform our understanding of the underlying 
system if we can connect the dimensions of variation to 
a particular stage of neural processing. 
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FIG. 1. The low-dimensional dynamics of pursuit eye velocity 
trajectories [Z. • (a) Eye movements were recorded from male 
rhesus monkeys (Macaca mulatto) that had been trained to 
fixate and track visual targets. Thin black and gray lines rep- 
resent H and V eye velocity in response to a step in target mo- 
tion on a single trial; dashed lines represent the correspond- 
ing trial-averaged means. Red and blue lines represent the 
model prediction (b) Three natural modes of variation corre- 
sponding to direction, speed and time provide an essentially 
complete basis for eye trajectories. Black and gray curves cor- 
respond to H and V components (c) The covariance matrix 
of the horizontal eye velocity trajectories. The yellow square 
marks 125ms during the fixation period prior to target mo- 
tion onset, the green square the first 125ms of pursuit. The 
color scale is in deg/s 2 (d) The eigenvalue spectrum of the 
difference matrix AC(t,t') = C pursu it(t, t') (green square) — 
Cbackground (M') (yellow square) . 



Like other types of movement, eye movements are po- 
tentially high dimensional if eye position and velocity 
vary independently from moment to moment. But an 
analysis of the natural variation in smooth pursuit eye 
movement behavior reveals a simple structure whose form 
suggests a neural origin for the noise that gives rise to be- 
havioral variation. Pursuit is a tracking eye movement, 
triggered by image motion on the retina, which serves 
to stabilize a target's retinal image and thus to prevent 
motion blur [6]. When a target begins to move rela- 
tive to the eye, the pursuit system interprets the result- 
ing image motion on the retina to estimate the target's 
trajectory and then to accelerate the eye to match the 



target's motion direction and speed. While tracking on 
longer time scales is driven by both retinal inputs and 
by extra-retinal feedback signals, the initial ~ 125 ms of 
the movement is generated purely from sensory estimates 
of the target's motion, using visual inputs present before 
the onset of the response. Focusing on just this initial 
portion of the pursuit movement, we can express the eye 
velocity in response to steps in target motion as a vec- 
tor, v(t) = vn(t)i + vv(t)j, where wjy(i) and vy(t) are 
the horizontal and vertical components of the velocity, 
respectively (solid black and gray lines in Fig la). If the 
initial 125 ms of eye movement is sampled every millisec- 
ond, the pursuit trajectories have 250 dimensions. 

We compute the covariance of fluctuations about the 
mean trajectory, shown in FigfTJ;. Focusing on a window 
of 125 ms at the start of the pursuit response (green box), 
we find that the first three eigenvalues of the covariance 
matrix are larger than the rest, which we confirmed by es- 
timating the standard error of the values for each dataset 
[7j. This low dimensional structure is not a limitation 
of the motor system, since during fixation (yellow box) 
there are 80 significant eigenvalues. Indeed, the small 
amplitude, high dimensional variation visible during fix- 
ation appears to be an ever present background noise 
that is swamped by the larger fluctuations in movement 
specific to pursuit. If the covariance of this background 
noise is subtracted from the covariance during pursuit, 
the 3 dimensional structure becomes essentially exact, 
accounting for ~ 94% of variations in eye velocity. 

How does low dimensionality in eye movement arise? 
The goal of the movement is to match the eye to the 
target's velocity, which is constant in these experiments. 
The brain must therefore interpret the activity of sensory 
neurons that represent its visual inputs, detecting that 
the target has begun to move (at time to) and estimating 
the direction 9 and speed v of motion. At best, the brain 
estimates these quantities and transforms these estimates 
into some desired trajectory of eye movements, which we 
can write as v(t; to, 9, v), where • denotes an estimate of 
the quantity •. But estimates are never perfect, so we 
should imagine that t = t a + St , an d so on, where Sto is 
the small error in the sensory estimate of target motion 
onset on a single trial. If these errors are small, we can 
write 



v(t)=v(t;t ,v,6)+5t 
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where the first term is the average eye movement made 
in response many repetitions of the target motion, the 
next three terms describe the effects of the sensory er- 
rors, and the final term is the background noise. Thus, if 
we can separate out the effects of the background noise, 
the fluctuations in v(t) from trial to trial should be de- 
scribed by just three random numbers, Sto, 59, and Sv: 
the variations should be three dimensional, as observed. 
The partial derivatives in Eq (JlJ can be measured as 
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the difference between the trial-averaged pursuit trajec- 
tories in response to slightly different target motions. In 
fact the average trajectories vary in a simple way, shift- 
ing along the t axis as we change to, rotating in space 
as we change 9, and scaling uniformly faster or slower as 



we change v [7J, so that the relevant derivatives can be 
estimated just from one average trajectory. When the 
dust settles, this means that we can write the covariance 
of fluctuations around the mean pursuit trajectory as 



dir 

spec 

J) 



(*) 

At) 

M 



(6v6 
{6t t 



(686v) (SeSto) 
(5v6v) {5vSt ) 
(6t 6v) (6t 6t ) 



vfL(*') 



(back) 



(2) 



where the terms (6969), (696v), etc. are the covariances 
of the sensory errors, and we have abbreviated the par- 
tial derivative expressions for the modes of variation as 



= dv/(t;t o ,v,0)/dO, 



dv/(t;t ,v,9)/dv, 



Vdil 

and vtimc = dv/(t;to,v,9)/dto- The fact that C can 
be written in this form implies not only that the varia- 
tions in pursuit will be three dimensional, but that we 
can predict in advance what these dimensions should be. 
Experimentally we find that the three relevant dimen- 
sions have 96% overlap with axes corresponding to Vdj r , 

V S p ee d and Vtimc. 

These results strongly support the hypothesis that the 
observable variations in motor output are dominated by 
the errors that the brain makes in estimating the param- 
eters of its sensory inputs, as if the rest of the process- 
ing and motor control circuitry were effectively noiseless, 
or more precisely that they contribute only at the level 
of background variation in the movement. Further, the 
magnitude and time course of noise in sensory estimation 
are comparable to the noise sources that limit perceptual 
discrimination [7j [8] . This unexpected result challenges 
our intuition that noise in the execution of movement cre- 
ates behavioral variation, and it forces us to consider that 
errors in sensory estimation may set the limit to behav- 
ioral precision. Our findings arc consistent with the idea 
that the brain can minimize the impact of noise in mo- 
tor execution in a task specific manner [5J [TU] , although 
it suggests a novel origin for that noise in the sensory 
system. The precision of smooth pursuit fits well with 
the broader view that the nervous system can approach 
optimal performance at critical tasks [TTMl4| . 



The way the worm wiggles 
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FIG. 2. The low-dimensional space of worm postures [15] . 

(a) We use tracking video microscopy to record images of the 
worm's body at high spatiotemporal resolution as it crawls 
along a flat agar surface. Dotted lines trace the worm's cen- 
troid trajectory and the body outline and centerline skeleton 
are extracted from the microscope image on a single frame. 

(b) We characterize worm shape by the tangent angle 6 vs. arc 
length s of the centerline skeleton, (c) We decompose each 
shape into four dominant modes by projecting 9(s) along the 
eigenvectors of the shape covariance matrix (eigenworms) . (d, 
black circles) The fraction of total variance captured by each 
projection. The four eigenworms account for ~ 95% of the 
variance within the space of shapes, (d, red diamonds) The 
fraction of total variance captured when worm shapes are rep- 
resented by images of the worm's body; the low dimensionality 
is hidden in this pixel representation. 



The free motion of the nematode C. elegans on a flat 
agar plate provides an ideal opportunity to quantify the 
(reasonably) natural behavior of an entire organism |15j . 
Under such conditions, changes in the worm's sinuous 
body shape support a variety of motor behaviors, includ- 
ing forward and backward crawling and large body bends 
known as Q— turns 16J. Tracking microscopy provides 
high spatial and temporal resolution images of the worm 
over long periods of time, and from these images we can 



see that fluctuations in the thickness of the worm are 
small, so most variations in the shape are captured by 
the curve that passes through the center of the body. 
We measure position along this curve (arc length) by the 
variable s, normalized so that s = is the head and 
s = 1 is the tail. The position of the body element at 
s is denoted by x(s), but it is more natural to give an 
"intrinsic" description of this curve in terms of the tan- 
gent angle 9(s), removing our choice of coordinates by 



4 



rotating each image so that the mean value of 6 along 
the body always is zero. Sampling at N = 100 equally 
spaced points along the body, each shape is described 
completely by a 100— dimensional vector (Fig 2a, b). 

As we did with smooth pursuit eye movements, we seek 
a low dimensional space that underlies the shapes we ob- 
serve. In the simplest case, this space is a Euclidean 
projection of the original high dimensional space so that 
the covariance matrix of angles, C(s, s') — ((9(s) — 
(6))(6(s') — (0))), will have only a small number of signif- 
icant eigenvalues. For C. elegans this is exactly what we 
find, as shown in Fig 2c, d: over 95% of the variance in 
body shape is accounted for by projections along just four 
dimensions ('eigenworms', red curves in Fig 2c). Further, 
the trajectory in this low dimensional space of shapes 
predicts the motion of the worm over the agar surface 
[17] . Importantly, the simplicity that we find depends on 
our choice of initial representation. For example, if we 
take raw images of the worm's body, cropped to a mini- 
mum size (300 x 160 pixels) and aligned to remove rigid 
translations and rotations, the variance across images is 
spread over hundreds of dimensions. 

The tangent angle representation and projections 
along the eigenworms provide a compact yet substantially 
complete description of worm behavior. In distinction to 
previous work (see e.g. [16j EH HSDi this description is 
naturally aligned to the organism, fully computable from 
the video images with no human intervention, and also 
simple. In the next section we show how these coordi- 
nates can be also used to explore dynamical questions 
posed by the behavior of C. elegans. 

Dynamics of worm behavior 

We have found low dimensional structure in the 
smooth pursuit eye movements of monkeys and in the free 
wiggling of nematodes. Can this simplification inform 
our understanding of behavioral dynamics — the emer- 
gence of discrete behavioral states, and the transitions 
between them? Here we use the trajectories of C. ele- 
gans in the low dimensional space to construct an ex- 
plicit stochastic model of crawling behavior, and then 
show how long-lived states and transitions between them 
emerge naturally from this model. 

Of the four dimensions in shape space that character- 
ize the crawling of C. elegans, motions along the first 
two combine to form an oscillation, corresponding to the 
wave which passes along the worm's body and drives it 
forward or backward. Here, we focus on the phase of this 
oscillation, = tan -1 (02/01) (Fig 3a), and construct, 
from the observed trajectories, a stochastic dynamical 
system, analogous to the Langevin equation for a Brow- 
nian particle. Since the worm can crawl both forward 
and backward, the phase dynamics is minimally a sec- 
ond order system, 





FIG. 3. Worm behavior in the eigenworm coordinates, (a) 
Amplitudes along the first two eigenworms oscillate, with 
nearly constant amplitude but time varying phase <f> = 
tan -1 (02/01). The shape coordinate <j>(t) captures the phase 
of the locomotory wave moving along the worm's body, (b) 
The phase dynamics from Eq |3| reveals attracting trajec- 
tories in worm motion: forward and backward limit cycles 
(white lines), and two instantaneous pause states (white cir- 
cles). Colors denote the basins of attraction for each attract- 
ing trajectory, (c) In an experiment in which the worm re- 
ceives a weak thermal impulse at time t = 0, we use the 
basins of attraction of (b) to label the instantaneous state of 
the worm's behavior and compute the time dependent prob- 
ability that a worm is in either of the two pause states. The 
pause states uncover an early-time stereotyped response to 
the thermal impulse, (d) The probability density of the phase 
(plotted as log P((j>\t)), illustrating stereotyped reversal tra- 
jectories consistent with a noise-induced transition from the 
forward state. Trajectories were generated using Eq |3| and 
aligned to the moment of a spontaneous reversal at t = 0. 



— =F(oj,cj ) ) + a(u J ,^i 1 (t), (3) 

where w is the phase velocity and rj(t) is the noise — a 
random component of the phase acceleration not related 
to the current state of the worm — normalized so that 
(ri(t)rj(t')) = 5(t — t'). As explained in Ref [15], we can 
recover the "force" F(u>, 4>) and the local noise strength 
<j(ui,<f)) from the raw data, so no further "modeling" is 
required. 

Leaving aside the noise, Eq ^ describes a dynamical 
system in which there are multiple attracting trajecto- 
ries (Fig 3b): two limit cycle attractors corresponding 
to forward and backward crawling (white lines) and two 
pause states (white circles) corresponding to an instan- 
taneous freeze in the posture of the worm. Thus, under- 
neath the continuous, stochastic dynamics we find four 
discrete states which correspond to well defined classes of 
behavior. We emphasize that these behavioral classes are 
emergent — there is nothing discrete about the phase time 
series <fi(t), nor have we labelled the worm's motion by 
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subjective criteria. While forward and backward crawl- 
ing are obvious behavioral states, the pauses are more 
subtle. Exploring the worm's response to gentle ther- 
mal stimuli, one can see that there is a relatively high 
probability of a brief sojourn in one of the pause states 
(Fig[3j:). Thus, by identifying the attractors — and the 
natural time scales of transitions between them — we un- 
cover a more reliable component of the worm's response 
to sensory stimuli |15j . 

The noise term generates small fluctuations around 
the attracting trajectories, but more dramatically drives 
transitions among the attractors, and these transitions 
are predicted to occur with stereotyped trajectories [20] , 
In particular, the Langevin dynamics in Eq |3]) predict 
spontaneous transitions between the attractors that cor- 
respond to forward and backward motion. To quantify 
this prediction, we run long simulations of the dynam- 
ics, choose moments in time when the system is near 
the forward attractor (0.1 < d<f>/dt < 0.6 cycles/s), and 
then compute the probability that the trajectory has 
not reversed (d<fi/dt < 0) after a time r following this 
moment. If reversals are rare, this survival probability 
should decay exponentially, P(t) = exp(— r/(r)), and 
this is what we see, with the predicted mean time to re- 
verse (r) = 15.7±2.1 s, where the error reflects variations 
across an ensemble of worms. 

We next examine the real trajectories of the worms, 
performing the same analysis of reversals by measuring 
the survival probability in the forward crawling state. We 
find that the data obey an exponential distribution, as 
predicted by the model, and the experimental mean time 
to reversal is (id ata ) = 16.3±0.3s. This observed reversal 
rate agrees with the model predictions within error bars, 
and this corresponds to a precision of ~ 4%, which is 
quite surprising. It should be remembered that we make 
our model of the dynamics by analyzing how the phase 
and phase velocity at the time t evolve into phase and 
phase velocity at time t + dt, where the data determine 
dt = 1/32 s. Once we have the stochastic dynamics, we 
can use them to predict the behavior on long time time 
scales. While we define our model on the timescale of a 
single video frame (dt), behavioral dynamics emerge that 
are nearly three orders of magnitude longer ((r)/dt ~ 
500), with no adjustable parameters [2D] . 

In this model, reversals are noise driven transitions 
between attractors, in much the same way that chem- 
ical reactions are thermally driven transitions between 
attractors in the space of molecular structures [2T]. In 
the low noise limit, the trajectories that carry the sys- 
tem from one attractor to another become stereotyped 
[2"2] , Thus, the trajectories that allow the worm to es- 
cape from the forward crawling attractor are clustered 
around prototypical trajectories, and this is seen both in 
the simulations (Fig[3ji) and in the data [2"D] . 

In fact, many organisms, from bacteria to humans, ex- 
hibit discrete, stereotyped motor behaviors. A common 
view is that these behaviors are stereotyped because they 
are triggered by specific commands, and in some cases 



we can even identify "command neurons" whose activ- 
ity provides the trigger [23] . In the extreme, discreteness 
and stereotypy of the behavior reduces to the discrete- 
ness and stereotypy of the action potentials generated 
by the command neurons, as with the escape behaviors 
in fish triggered by spiking of the Mauthner cell [24] , 
But the stereotypy of spikes itself emerges from the con- 
tinuous dynamics of currents, voltages and ion channel 
populations [53] [5S] . The success here of the stochastic 
phase model in predicting the observed reversal charac- 
teristics of C. elegans demonstrates that stereotypy can 
also emerge directly from the dynamics of the behavior 
itself. 



III. MAXIMUM ENTROPY MODELS OF 
NATURAL NETWORKS 

Much of what happens in living systems is the result 
of interactions among large networks of elements — many 
amino acids interact to determine the structure and func- 
tion of proteins, many genes interact to define the fates 
and states of cells, many neurons interact to represent 
our perceptions and memories, and so on. Even if each 
element in a network achieves only two values, the num- 
ber of possible states in a network of N elements is 2^, 
which easily becomes larger than any realistic experiment 
(or lifetime!) can sample, the same dimensionality prob- 
lem that we encountered in movement behavior. Indeed, 
a lookup table for the probability of finding a network in 
any one state has ~ 2 N parameters, and this is a disaster. 
To make progress we search for a simpler class of models 
with many fewer parameters. 

We seek an analysis of living networks that leverages 
increasingly high-throughput experimental methods such 
as the recording from large numbers of neurons simulta- 
neously These experiments provide, for example, reliable 
information about the correlations between the action 
potentials generated by pairs of neurons. In a similar 
spirit, we can measure the correlations between amino 
acid substitutions at different sites across large families 
of proteins. Can we use these pairwise correlations to say 
anything about the network as a whole? While there are 
an infinite number of models that can generate a given 
pattern of pairwise correlations, there is a unique model 
that reproduces the measured correlations and adds no 
additional structure. This minimally structured model is 
the one that maximizes the entropy of the system [27] , in 
the same way that the thermal equilibrium (Boltzmann) 
distribution maximizes the entropy of a physical system 
given that we know its average energy. 



Letters in words 

To see how the maximum entropy idea works, we ex- 
amine an example where we have some intuition for the 
states of the network. Consider the spelling of four let- 



6 



ter English words [28], where at positions i = 1,2,3,4 
in the word we can chose a variable Xi from 26 possible 
values. A word is then represented by the combination 
x = {xi, X2, xq, £4}, and we can sample the distribu- 
tion of words, -P(x), by looking through a large corpus of 
writings, for example the collected novels of Jane Austen 
[2"5] . If we don't know anything about the distribution 
of states in this network, we can maximize the entropy 
of the distribution P(x) by having all possible combina- 
tions of letters be equally likely, and then the entropy is 
S = -£P log 2 P = 4 x log 2 (26) = 18.8 bits. But, in 
actual English words, not all letters occur equally often, 
and this bias in the use of letters is different at different 
positions in the word. If we take these "one letter" statis- 
tics into account, the maximum entropy distribution is 
the independent model, 



(b) 



P«(x) = Pi{x 1 )P 2 {x 2 )P 3 {x 3 )P i (x i ) 1 



(4) 



where Pi(x) is the easily- measured probability of finding 
letter x in position i. Taking account of actual letter 
frequencies lowers the entropy to Si — 14.083 ±0.001 bits 
where the small error bar is derived from sampling across 
the ~ 10 6 word corpus. 

The independent letter model defined by P^ is clearly 
wrong: the most likely words are 'thae', 'thee' and 'teae.' 
Can we build a better approximation to the distribution 
of words by including correlations between pairs of let- 
ters? The difficulty is that now there is no simple formula 
like Eq Q which connects the maximum entropy distri- 
bution for x to the measured distributions of letter pairs 
(xi,Xj). Instead we know analytically the form of the 
distribution, 



p( 2 )(x) = |exp 



1>J 



(5) 



where all of the coefficients V^{x,x') have to be chosen 
to reproduce the observed correlations between pairs of 
letters. This is complicated, but much less complicated 
than it could be — by matching all the pairwise correla- 
tions we are fixing ~ 6 x (26) 2 parameters, which is vastly 
smaller than the (26) 4 possible combinations of letters. 

The model in Eq ^ has exactly the form of the 
Boltzmann distribution for a physical system in thermal 
equilibrium, where the letters "interact" through a po- 
tential energy Vij(x,x'). The essential simplification is 
that there are no explicit interactions among triplets or 
quadruplets — all the higher order correlations must be 
consequences of the pairwise interactions. We know that 
in many physical systems this is a good approximation, 
that is P ~ P^ 2 '. However, the rules of spelling (e.g., 
i before e except after c) seem to be in explicit conflict 
with such a simplification. Nonetheless, when we apply 
the model in Eq ^ to English words, we find reasonable 
phonetic constructions. Here we leave aside the problem 
of how one finds the potentials Vy from the measured 
correlations among pairs of letters (see Refs [3TJH56"] ). and 
discuss the results. 
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FIG. 4. For networks of neurons and letters, the pairwise max- 
imum entropy model provides an excellent approximation to 
the probability of network states. In each case, we show the 
Zipf plot for real data (blue) compared to the pairwise maxi- 
mum entropy approximation (red). Scale bars to the right of 
each plot indicate the entropy captured by the pairwise model, 
(a) Letters within four letter English words [28]. The max- 
imum entropy model also produces 'non- words' (inset, green 
circles) that never appeared in the full corpus but nonetheless 
contain realistic phonetic structure, (b) 10 neuron patterns 
of spiking and silence in the vertebrate retina [37] ■ 



Once we construct a maximum entropy model of words 
using Eq ([5]), we find that the entropy of the pairwise 
model is S 2 — 7.471 ± 0.006 bits, about half the entropy 
of independent letters Si. A rough way to think about 
this result is that if letters were chosen independently, 
there would be 2 Sl ~ 17, 350 possible four letter words. 
Taking account of the pairwise correlations reduces this 
vocabulary by a factor of 2 Sl ~ S2 ~ 100, down to ef- 
fectively ~ 178 words. In fact, the Jane Austen cor- 
pus is large enough that we can estimate the true en- 
tropy of the distribution of four letter words, and this is 
Sf u n = 6.92 ± 0.003 bits. Thus the pairwise model cap- 
tures ~ 92% of the entropy reduction relative to choosing 
letters independently, and hence accounts for almost all 
of the restriction in vocabulary provided by the spelling 
rules and the varying frequencies of word usage. The 
same result is obtained with other corpora, so this is not 
a peculiarity of an author's style. 

We can look more closely at the predictions of the max- 
imum entropy model in a "Zipf plot," ranking the words 
by their probability of occurrence and plotting probabil- 
ity vs. rank, as in Fig [4] The predicted Zipf plot almost 
perfectly overlays what we obtain by sampling the cor- 
pus, although some weight is predicted to occur in words 
that do not appear in Austen's writing. Many of these 
are real words that she happened not to use, and others 
are perfectly pronounceable English even if they are not 
actually words. Thus, despite the complexity of spelling 
rules, the pairwise model captures a very large fraction 
of the structure in the network of letters. 



Spiking and silence in neural networks 

Maximum entropy models also provide a good approx- 
imation to the patterns of spiking in the neural network 
of the retina. In a network of neurons where the variable 
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Xi marks the presence (xi = +1) or absence (x\ = — 1) of 
an action potential from neuron i in a small window of 
time, the state of the whole network is given by the pat- 
tern of spiking and silence across the entire population of 
neurons, x = {xi,X2, • ■ ■ ,xn}- In the original example 
of these ideas, Schneidman et al [37] looked at groups of 
N = 10 nearby neurons in the vertebrate retina as it re- 
sponded to naturalistic stimuli, with the results shown in 
Fig [4] Again we see that the pairwise model does an ex- 
cellent job, capturing ~ 90% or more of the reduction in 
entropy, reproducing the Zipf plot, and even predicting 
the wildly varying probabilities of the particular patterns 
of spiking and silence (see Fig 2a of Ref [37]). 

The maximum entropy models discussed here are im- 
portant because they often capture a large fraction of the 
interactions present in natural networks while simultane- 
ously avoiding a combinatorial explosion in the number of 
parameters. This is true even in cases where interactions 
are strong enough so that independent (i.e. zero neuron- 
neuron correlation) models fail dramatically. Such an 
approach has also recently been used to show how net- 
work functions such as stimulus decorrelation and error 
correction reflect a trade-off between efficient consump- 
tion of finite neural bandwidth and the use of redundancy 
to mitigate noise [SH] . 

As we look at larger networks, we can no longer com- 
pute the full distribution and thus we cannot directly 
compare the full entropy with it's pairwise approxima- 
tion. We can, however, check many other predictions 
and the maximum entropy model works well, at least to 
N = 40 [311 13S] ■ Related ideas have also been applied to 
a variety of neural networks with similar findings |40H43] 
(however, also see [H] for differences), which suggest that 
the networks in the retina are typical of a larger class of 
natural ensembles. 



Metastable states 

As we have emphasized in discussing Eq ([5]), maxi- 
mum entropy models are exactly equivalent to Boltzmann 
distributions, and thus define an effective "energy" for 
each possible configuration of the network. States of high 
probability correspond to low energy, and we can think of 
an "energy landscape" over the space of possible states, 
in the spirit of the Hopfield model for neural networks 
|45j . Once we construct this landscape, it is clear that 
some states are special because they sit at the bottom of 
a valley — at local minima of the energy. For networks of 
neurons, these special states are such that flipping any 
single bit in the pattern of spiking and silence across the 
population generates a state with lower probability. For 
words, a local minimum of the energy means that chang- 
ing any one letter produces a word of lower probability. 

The picture of an energy landscape on the states of 
a network may seem abstract, but the local minima 
can (sometimes surprisingly) have functional meaning, 
as shown in Fig [5} In the case of the retina, a maximum 
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FIG. 5. Metastable states in the energy landscape of networks 
of neurons and letters, (a) Probability that the 40 neuron sys- 
tem is found within the basin of attraction of each nontrivial 
locally stable state G a as a function of time during the 145 
repetitions of the stimulus movie. The inset shows the state 
of the entire network at the moment it enters the basin of G5, 
on 60 successive trials, (b) The energy landscape (e = — InP) 
in the maximum entropy model of letters in words. We order 
the basins in the landscape by decreasing probability of their 
ground states, and show the low energy excitations in each 
basin. 



entropy model was constructed to describe the states of 
spiking and silence in a population of N = 40 neurons 
as they respond to naturalistic inputs, and this model 
predicts the existence of several non-trivial local min- 
ima [311 139j . Importantly, this analysis does not make 
any reference to the visual stimulus. But if we play the 
same stimulus movie many times, we see that the system 
returns to the same valleys or basins surrounding these 
special states, even though the precise pattern of spiking 
and silence is not reproduced from trial to trial (Fig [5k) • 
This suggests that the response of the population can be 
summarized by which valley the system is in, with the de- 
tailed spiking pattern being akin to variations in spelling. 
To reinforce this analogy, we can look at the local minima 
of the energy landscape for four letter words. 

In the maximum entropy model for letters, we find 136 
of local minima, of which the 10 most likely are shown 
in Fig 5b. More than 2/3 of the entropy in the full dis- 
tribution of words is contained in the distribution over 
these valleys, and in most of these valleys there is a large 
gap between the bottom of the basin (the most likely 
word) and the next most likely word. Thus, the entropy 
of the letter distribution is dominated by states which 
are not connected to each other by single letter substi- 
tutions, perhaps reflecting a pressure within language to 
communicate without confusion. 



IV. DISCUSSION 

Understanding a complex system necessarily involves 
some sort of simplification. We have emphasized that, 
with the right data, there are mathematical methods 
which allow a system to "tell us" what sort of simpli- 



fication is likely to be useful. 

Dimensionality reduction is perhaps the most obvious 
method of simplification — a direct reduction in the num- 
ber of variables that we need to describe the system. The 
examples of C. elegans crawling and smooth pursuit eye 
movements are compelling because the reduction is so 
complete, with just three or four coordinates capturing 
~ 95% of all the variance in behavior. In each case, the 
low dimensionality of our description provides functional 
insight, whether into origins of stereotypy or the possi- 
bility of optimal performance. The idea of dimensional- 
ity reduction in fact has a long history in neuroscience, 
since receptive fields and feature selectivity are naturally 
formalized by saying that neurons are sensitive only to 
a limited number of dimensions in stimulus space |46l - 
139] . More recently it has been emphasized that quanti- 
tative models of protein/DNA interactions are equivalent 
to the hypothesis that proteins are sensitive only to lim- 
ited number of dimensions in sequence space |50l 151] . 

The maximum entropy approach achieves a similar 
simplification for networks; it searches for simplification 
not in the number of variables, but in the number of pos- 
sible interactions among these variables. The example of 
letters in words shows how this simplification retains the 
power to describe seemingly combinatorial patterns. For 
both neurons and letters, the mapping of the maximum 
entropy model onto an energy landscape points to special 
states of the system that seem to have functional signif- 
icance. There is an independent stream of work which 
emphasizes the sufficiency of pairwise correlations among 



amino acid substitutions in defining functional families 
of proteins (52J [53] , and this is equivalent to the maxi- 
mum entropy approach [53]; explicit construction of the 
maximum entropy models for antibody diversity again 
points to the functional importance of the metastable 
states [55] , 

Although we have phrased the ideas of this paper es- 
sentially as methods of data analysis, the repeated suc- 
cesses of mathematically equivalent models (dimensional- 
ity reduction in movement and maximum entropy in net- 
works) encourages us to seek unifying theoretical princi- 
ples that give rise to behavioral simplicity. Finding such 
a theory, however, will only be possible if we observe 
behavior in sufficiently unconstrained contexts so that 
simplicity is something we discover rather than impose. 
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